Distributed processing of stream data on an event protocol

ABSTRACT

An exemplary method for distributed processing of streaming data on an event protocol comprises receiving a plurality of related events from the streaming data at a node, amending a state of the related events, determining an error margin based on the amended state, and updating a current data transformation based on the amended state and error margin, thereby enabling real time analysis of streaming data.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.11/974,113, filed on Oct. 11, 2007, entitled: “Distributed Processing ofStreaming Data on an Event Protocol” and is hereby incorporated hereinby reference in its entirety for all purposes.

BACKGROUND

MapReduce is a well-known programming model that enables development ofscalable parallel applications to process a large amount of data onclusters of commodity computing machines. In general, through aninterface with two functions, a map function and a reduce function, theMapReduce model can facilitate parallel implementation of manyreal-world tasks such as data processing for search engines.

A commodity machine (or a portion of a commodity machine) performing mapfunctions may be referred to as a map node. Likewise, a commoditymachine (or a portion of a commodity machine) performing reducefunctions may be referred to as a reduce node. It is possible for acommodity machine to perform both map functions and reduce functions,depending on the required computations and the capacity of the machine.Typically, each map node has a one-to-one connection with acorresponding first-level reduce node. Multiple levels of reduce nodesare generally required for processing large data sets.

An initial large dataset to be processed in a MapReduce model isgenerally a fixed or static data set. A large fixed dataset is firstdivided into smaller data sets by the map nodes. The smaller data setsare then sent to first level reduce nodes for performing a reducefunction on the smaller data sets. The reduce functions generate asmaller set of values which will be re-reduced at the next levels untila final result or measurement is attained. A visual representation of aMapReduce architecture may comprise nodes arranged in a funnel shapewherein an initial data set is incrementally reduced at each level untila final result exits the tip of the funnel.

It would be beneficial to provide a system capable of distributedprocessing of dynamic or streaming data without needing multiple levelsof reduce nodes.

SUMMARY

An exemplary method for distributed processing of streaming data on anevent protocol comprises receiving a plurality of related events fromthe streaming data at a node, amending a state of the related events,determining an error margin based on the amended state, and updating acurrent data transformation based on the amended state and error margin,thereby enabling real time analysis of streaming data.

Other exemplary embodiments and implementations are disclosed herein.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates an exemplary system for distributed processing ofstreaming data.

FIG. 2 illustrates an exemplary reduce node for distributed processingof streaming data.

FIG. 3 illustrates an exemplary process for distributed processing ofstreaming data.

FIG. 4 illustrates an exemplary process performed by a reduce node.

FIG. 5 illustrates an exemplary process performed by a reduce node whichreceives related events to multiple keys.

FIG. 6 illustrates an exemplary implementation.

FIG. 7 illustrates another exemplary implementation.

DETAILED DESCRIPTION

The following description is presented to enable any person skilled inthe art to make and use the invention, and is provided in the context ofparticular applications of the invention and their requirements. Variousmodifications to the disclosed embodiments will be readily apparent tothose skilled in the art and the general principles defined herein maybe applied to other embodiments and applications without departing fromthe scope of the present invention. Thus, the present invention is notintended to be limited to the embodiments shown, but is to be accordedthe widest scope consistent with the principles and features disclosedherein.

I. Overview

Section II describes exemplary systems for distributed processing ofstreaming data.

Section III describes exemplary processes for distributed processing ofstreaming data.

Section IV describes exemplary implementations.

Section V describes an exemplary computing environment.

II. Exemplary Systems for Distributed Processing of Streaming Data

FIG. 1 illustrates an exemplary system 100 for distributed processing ofstreaming data. The exemplary system 100 includes a service 110, aplurality of storages 120 a-120 c, a plurality of data services 130a-130 c, a plurality of map nodes 140 a-140 c, a plurality of reducenodes 150 a-150 c, an aggregate node 160, and consumers 170. For ease ofillustration and explanation, representative components are illustratedin FIG. 1. One skilled in the art will recognize that more or fewercomponents may be added or removed depending on specificimplementations.

The service 110 may be any entity which generates or otherwise obtainsdata that can be processed to determine a useful result or measurement.For example, without limitation, a service may be a call center whichcontinuously receives call related data such as customer data, operatorperformance data, product data, sales data, etc. Data streaming into theservice 110 can be divided into events using an event protocol. Anyevent protocol known in the art can be implemented to process thestreaming data. Exemplary event protocols include, without limitation,Java Message Service (JMS), Syslog, and Simple Mail Transfer Protocol(SMTP). In general, events are related by a key function (e.g., Key=f(Event)). Thus, events are related if they have the same key. Keycalculations based on an event protocol are well known in the art andneed not be described in more detail herein.

Storages 120 may be any form of storage, for example, databases, diskdrives, read-only-memories, and so forth. Data services 130 are servicesconfigured to fetch data from the data storages 120. In an exemplaryimplementation, a data service 130 typically has a web server enabled toobtain data over a communication network (e.g., the Internet) from oneor more data storages 120 depending on a particular result to beachieved.

For each execution of the data-processing algorithm, Map nodes 140 areconfigured to process data received from different data services asevents and map the events into a format that is appropriate for the dataprocessing and acceptable by the reduce nodes 150. In an exemplaryimplementation, events are mapped on a line-by-line basis into a limitednumber of formats. In addition, for each event received at a map node140, the map node 140 determines a key for the event and sends relatedevents (i.e., events having the same key) to the same reduce node 150.The map node 140 may or may not compute the key for each event.

In an exemplary implementation, the map nodes 140 are cross connected tothe reduce nodes 150. That is, each map node 140 may connect to some orall reduce nodes 150 so that each map node 140 is enabled to sendrelated events to the same reduce node 150. The map nodes 140 determinewhich reduce node to send each event based on the keys of the events.

For each key, a reduce node 150 applies a data transformation (e.g., areduce function) to the related events to generate a result. The resultmay or may not be a final result. If the result is not final, an errormargin may be calculated using any suitable error function known in theart. An interim result can be determined based on the error margin.Exemplary processes performed by a reduce node 150 are described belowwith reference to FIGS. 4 and 5.

The aggregate node 160 may nor may not reside in the same computingdevice as the other nodes (i.e., map nodes or reduce nodes). In anexemplary implementation, the aggregate node 160 is configured tocollect all the data from the reduce nodes 150 (e.g., from a cache) andprovide an aggregated result to a requesting consumer 170.

In an exemplary implementation, a feedback loop is provided to enableany consumer feedback to the service 110.

FIG. 2 illustrates an exemplary reduce node 150. The exemplary reducenode 150 includes a data transformation module 210, a cache 220, anerror calculation module 230 and a result generation module 240. Relatedevents having the same key are received by the data transformationmodule 210. The data transformation module 210 fetches a current state(e.g., using the key) from the cache 220 and amends the state based onthe received events, if necessary. If all the data for a given key havebeen received, then a final result is generated by the result generationmodule 240 based on an amended state associated with the key. If it isdetermined that more data will be received for a given key, then anerror margin is calculated by the error calculation module 230. Aninterim result is generated by the result generation module 240 based onan amended state and an error margin associated with the key. Theresult, final or interim, is provided as an output from the reduce node150.

III. Exemplary Processes for Distributed Processing of Streaming Data

FIG. 3 illustrates an exemplary process for distributed processing ofstreaming data based on an event protocol.

At step 310, streaming data are obtained by a service 110.

At step 320, the data are stored in data storages 120.

At step 330, a job request from a consumer is received. In an exemplaryimplementation, any authorized entity may make a job request. The jobrequest may be sent to one or more of the service 110, the map node 140,and/or any other components in the system 100, depending on designchoice. In an exemplary implementation, a job request includes a datafilter, a map task, and a reduce task. The data filter enables the mapnodes 140 to obtain the particular data needed to satisfy the jobrequest. The map task identifies the map function to be performed at themap nodes 140 to map data into a common format. The reduce taskidentifies the data transformation function(s) to be applied at a reducenode 150 that receives all (or substantially all) related events havingthe same key. The mechanics for composing a job request is well known inthe art and need not be described in more detail herein.

At step 340, events are obtained from data storages 120 by data services130 and sent to map nodes 140.

At step 350, the fetched data are divided into events in accordance withan event protocol and mapped to a common format by the map nodes 140.

At step 360, keys are determined for related events. Events are relatedif they have a common key. In an exemplary implementation, the map nodes140 determine the keys based on the job request, the content of theevents, and/or other information associated with the events. In anotherexemplary implementation, the map nodes 140 may obtain the keys from anyother entity (not shown).

At step 370, related events are sent to the same reduce node 150. Ingeneral, events that are related have the same key. In an exemplaryimplementation, the map nodes apply a hash function (e.g., a distributedhash function or d-hash) to determine a hash value based on each key.The hash value determines which reduce node to send each event. As aresult, events having the same key are routed to the same reduce node.

At step 380, data transformations are applied by the reduce nodes 150 torelated events to determine a result. In an exemplary implementation,the result may be a final result. In another exemplary implementation,an interim result may be provided if additional streaming data are stillto be received for a given key. One skilled in the art will recognizethat the mathematical function to be applied as a data transformationdepends on the nature of a job request (e.g., via problems involvingcumulative transitive functions or transitive summary functions, etc.).Exemplary data transformations include, without limitation, OLAP cubes,probability distribution functions, threshold monitoring, etc.

FIG. 4 illustrates an exemplary process performed by a reduce node toupdate a current data transformation of related events sharing the samekey.

At step 410, a reduce node 150 receives a plurality of related events.In an exemplary implementation, the related events are sent from one ormore map nodes 140 that are cross connected to the reduce node 150.

At step 420, the reduce node 150 amends a state of the related events.In an exemplary implementation, the reduce node 150 uses the common keyof the related events to fetch a current state from a local or remotecache 220. The reduce node 150 then determines, based on the content ofeach related event, whether to amend the current state to an amendedstate.

At step 430, the reduce node 150 determines an error margin based on theamended state. Any known error functions may be applied, dependingdesign choice and/or the nature of the job request, to determine adistribution of error.

At step 440, the reduce node 150 updates a current data transformationbased on the amended state and the error margin. In an exemplaryimplementation, the reduce node 150 determines a result, final orinterim, as the new data transformation and updates the current datatransformation based on the new result.

FIG. 5 illustrates an exemplary process performed by a reduce nodereceiving multiple sets of related events, wherein each set of eventsshares the same key.

At step 510, a plurality of keys is obtained.

At step 520, the first key, K, is set to equal to

At step 530, a data transformation is applied to the events having thekey K. In an exemplary implementation, a data transformation is appliedto determine one or more values germane to a job request. For example,the result of a data transformation may be an amended state for the keyK.

At step 540, whether or not all events having the same key K have beenreceived is determined.

If yes, at step 550, a final result is determined based on the datatransformation and reduce function the process continues at step 580.

If no, at step 560, an error margin is determined based on the applieddata transformation. All data transformations have known error marginfunctions. Exemplary error margin functions include, without limitation,weighted distance determinators and breadth first leveling.

At step 570, an interim result is determined based on the datatransformation and the error margin.

At step 580, whether events for all keys have been processed isdetermined.

If yes, the process ends.

If not, at step 590, K is set to equal to K_(i+1) and the processrepeats at step 530 for all events having the key K_(i+1).

IV. Exemplary Implementations

FIGS. 6-7 illustrate exemplary implementations for distributedprocessing of streaming data to obtain useful results.

1. First Exemplary Implementation

FIG. 6 illustrates an exemplary process for distributed processing ofstreaming data for obtaining a result for a problem that does notrequire keeping a state of related events (e.g., problems involvingcumulative transitive functions).

At step 610, an event is obtained at a map node. In an exemplaryimplementation, the event is fetched by the map node in response to ajob request from a consumer. In this exemplary implementation, theconsumer may wish to determine, for a given campaign, how manysimultaneous calls are occurring at any moment. The campaign may have anassociated product identifier.

At step 620, a key is determined. In an exemplary implementation, themap node may use the product identifier as the key.

At step 630, a hash value, H, is determined based on the key. Hashfunctions are well known in the art and need not be described in moredetail herein.

At step 640, the event is routed to a reduce node identified by the hashvalue H (reduce node H). In an exemplary implementation, related eventshaving the same key are routed to the same reduce node.

At step 650, based on the key of the event, the reduce node H obtains anevent count from a cache.

At step 660, the reduce node H determines whether the event is a stopevent (i.e., whether the call associated with the event has ended).

If yes, at step 670, the reduce node H subtracts 1 from the event count.

If no, at step 680, the reduce node H adds 1 to the event count.

At step 690, the reduce node updates an error margin and saves thecumulative count value in the cache. An error margin is calculated basedon any known error function which calculates the change between completeand incomplete call sessions. In an exemplary implementation, the cacheis accessible to an aggregator node which may output results to aconsumer.

The process described above is merely exemplary. One skilled in the artwill recognize that other results or measurements may be obtained byprocessing the streaming data. For example, a consumer may be able todetermine the minimum, maximum and/or average counts of calls over anyperiod of time. In this example, a reduce node may adjust a timestampaverage as new events arrive.

2. Second Exemplary Implementation

FIG. 7 illustrates an exemplary process for distributed processing ofstreaming data for obtaining a result for a problem that requireskeeping a state of related events (e.g., problems involving transitivesummary functions).

At step 710, an event is obtained at a map node. In an exemplaryimplementation, the event is fetched by the map node in response to ajob request from a consumer. In this exemplary implementation, theconsumer may wish to determine, for a given service, what is the averagecompleted call sessions over a period of time T.

At step 720, a key is determined. In an exemplary implementation, themap node may use the service name as the key.

At step 730, a hash value, H, is determined based on the key. Hashfunctions are well known in the art and need not be described in moredetail herein.

At step 740, the event is routed to a reduce node identified by the hashvalue H (reduce node H). In an exemplary implementation, related eventshaving the same key are routed to the same reduce node.

At step 750, based on the key of the event, the reduce node H obtainscurrent average completed call sessions over time T from a cache.

At step 760, the reduce node adjusts the average and time based on theevent.

In this particular example, at step 770, the error margin is set to zerobecause there is no uncertainty at any given time whether a call ison-going or has completed.

At step 780, the reduce node saves the average value and time in thecache. In an exemplary implementation, the cache is accessible to anaggregator node which may output results to a consumer.

The process described above is merely exemplary. One skilled in the artwill recognize that other results or measurements may be obtained byprocessing the streaming data.

3. Other Exemplary Implementations

The exemplary implementations described above are merely illustrative.One skilled in the art will recognize that other applications may beimplemented to generate results based on distributed processing ofstreaming data. For example, in a call center service scenario, thesystem can be used to solve problems relating to monitoring, real timebilling, real time analysis, dynamic service, real time qualityassurance, dynamic load balancing, script comparison, performanceanalysis, and/or other problems that involve data processing.

Of course, the invention is not limited to call center services. It maybe implemented in any service wherein real time analysis is useful forsupporting operations and/or business objectives. For example, thesystem may be implemented to compute entire body systemic changes duringan on-going surgery, detect heat changes by faults to predictearthquakes, determine stock trending, and/or other implementations.

V. Exemplary Operating Environments

The program environment in which a present embodiment of the inventionis executed illustratively incorporates a general-purpose computer or aspecial purpose device such as a hand-held computer. Details of suchdevices (e.g., processor, memory, data storage, display) may be omittedfor the sake of clarity.

It should also be understood that the techniques of the presentinvention may be implemented using a variety of technologies. Forexample, the methods described herein may be implemented in softwareexecuting on a computer system, or implemented in hardware utilizingeither a combination of microprocessors or other specially designedapplication specific integrated circuits, programmable logic devices, orvarious combinations thereof. In particular, the methods describedherein may be implemented by a series of computer-executableinstructions residing on a suitable computer-readable medium. Suitablecomputer-readable media may include volatile (e.g., RAM) and/ornon-volatile (e.g., ROM, disk) memory.

The foregoing embodiments of the invention have been presented forpurposes of illustration and description only. They are not intended tobe exhaustive or to limit the invention to the forms disclosed.Accordingly, the scope of the invention is defined by the appendedclaims, not the preceding disclosure.

What is claimed is:
 1. A method for distributed processing of streamingdata on an event protocol, comprising: (a) receiving a plurality ofrelated events from said streaming data at a node, said streaming dataincluding data being dynamically collected while the data are beingprocessed; (b) amending a state of said related events based on a commonkey function of the related events; (c) determining an error marginbased on the amended state, said error margin representing anuncertainty in the streaming data as a result of a portion of said databeing from on-going phone calls; and (d) updating a current datatransformation based on the amended state and error margin, therebyenabling real time analysis of streaming data.